Intra- and inter-rater reliability, agreement, and minimal detectable change of the handheld dynamometer in individuals with symptomatic hip osteoarthritis

Introduction The handheld dynamometer has been validated to measure muscle strength in different muscle groups. However, to date, it has not been tested in individuals who experience pain induced by hip osteoarthritis. The current study aimed to evaluate the intra- and inter-rater reliability, agreement, and minimal detectable change of the Lafayette handheld dynamometer, model 1165, to assess the peak force (Pk) and average peak force (Af) of hip muscles in individuals with symptomatic hip osteoarthritis. Methods Twenty participants with hip osteoarthritis (mean ± SD age: 58.7±15.3 years; body mass index: 28.8±4.2 kg/m2) and pain intensity on the Visual Analogue Scale ≥ 4 (8.05±1.2) were recruited to participate in this study. Pk and Af of hip flexors (seated position), abductors and adductors (supine position), and extensors (prone position) were collected in a single day by two independent raters, each one obtaining test and retest in randomly ordered separate sessions. Results The intra-rater intraclass correlation coefficient (ICC) was classified as good (>0.75) or excellent (≥0.90) for all muscle groups and all inter-rater ICCs were classified as excellent. Rater A had a lower standard error of measurement compared to rater B, ranging from 0.15 to 0.58 kilogram-force (Kgf) compared with 0.34 to 1.25 kg, respectively. However, the inter-rater comparison showed a minimal detectable change (MDC) of < 10% for all Pk and Af measures for hip adductors and extensors. Finally, the inter-rater Bland-Altman analysis demonstrated good agreement for abductors, adductors, and extensors. Conclusion Despite pain and dysfunction related to hip osteoarthritis, the mean of two measures using a handheld dynamometer was shown to be a reliable tool to assess hip muscle strength, with good to excellent intra- and inter-rater ICCs, satisfactory agreement, and small values for MDC.


Introduction
The handheld dynamometer has been validated to measure muscle strength in different muscle groups. However, to date, it has not been tested in individuals who experience pain induced by hip osteoarthritis. The current study aimed to evaluate the intra-and inter-rater reliability, agreement, and minimal detectable change of the Lafayette handheld dynamometer, model 1165, to assess the peak force (Pk) and average peak force (Af) of hip muscles in individuals with symptomatic hip osteoarthritis.

Methods
Twenty participants with hip osteoarthritis (mean ± SD age: 58.7±15.3 years; body mass index: 28.8±4.2 kg/m 2 ) and pain intensity on the Visual Analogue Scale � 4 (8.05±1.2) were recruited to participate in this study. Pk and Af of hip flexors (seated position), abductors and adductors (supine position), and extensors (prone position) were collected in a single day by two independent raters, each one obtaining test and retest in randomly ordered separate sessions.

Results
The intra-rater intraclass correlation coefficient (ICC) was classified as good (>0.75) or excellent (�0.90) for all muscle groups and all inter-rater ICCs were classified as excellent. Rater A had a lower standard error of measurement compared to rater B, ranging from 0.15 to 0.58 kilogram-force (Kgf) compared with 0.34 to 1.25 kg, respectively. However, the interrater comparison showed a minimal detectable change (MDC) of < 10% for all Pk and Af

Introduction
Hip Osteoarthritis (OA) is an end-stage disease from various causes, resulting in chronic hip pain, dysfunction, and stiffness. It is estimated that symptoms are present in 5 to 10% of adults older than 40 and 45 years, considering the Spanish and American populations, respectively, with a higher prevalence with increasing age [1,2]. Chronic hip pain is associated with muscle atrophy and weakness, as demonstrated in a meta-analysis conducted with pooled data from thirteen articles. Collectively, the authors observed a reduction in muscle strength in individuals with osteoarthritis that mainly affects hip flexors (-22%) and hip extensors (-21%), or abductors (-31%) and adductors (-25%) compared to healthy control groups [3]. Muscle weakness and atrophy seem to have a central role in the dysfunction related to hip osteoarthritis, as demonstrated by imaging studies and isometric dynamometry [3][4][5]. Both are deeply connected to the degree of radiographic OA classification, and their progress should be avoided through participation in exercise programs that include aerobic and strengthening exercises [6,7]. However, a reliable and easy method seems to be necessary to measure the strength of hip muscles in the clinical routine, in order to monitor disease and treatment progression in the rehabilitation process.
Measurement of Peak Force (Pk) has been considered the gold standard method for isokinetic test parameters to evaluate muscle function [8] and can be acquired with fixed or portable dynamometry. Handheld dynamometers (HHD) have been suggested as a practical, feasible, and simple tool to assess isometric lower limb muscle strength in the clinical setting [9] compared to fixed laboratory-based dynamometry, such as isokinetic dynamometers [10,11]. In addition, manual dynamometers require little training for proficient application [10,12] and have lower costs than fixed laboratory-based dynamometry [13,14].
Several studies have validated and recommended the use of different HHDs to measure hip muscle strength, with good to excellent intraclass correlation coefficients (ICC). The validity of two HHDs compared with a fixed laboratory-based dynamometer was previously demonstrated [11], with good to excellent reliability, particularly for proximal muscle groups in the lower limbs of health subjects. Other author also found good to excellent reliability when evaluating hip flexors and adductors of young adult football players [15], or when evaluating young, healthy adults with similar results when testing hip and knee strength [16]. The literature assessed different hip muscles in various protocols and also recommended the use of manual devices to acquire hip muscle strength in healthy subjects [9,17,18]. Only one study tested the use of the HHD in older people (> 65 years old) and found good reliability for hip and knee muscle strength measures, without specifying lower limb articular disease [10]. There are no definitive findings about the reliability of HHD measurements in participants with symptomatic hip OA. Accordingly, it is crucial to determine if pain intensity related to hip OA could affect the reliability of HHD in muscle strength evaluations of the hip, since comfort may be a potential limitation for strength performance [12]. Furthermore, the standard error of measurement (SEM) and minimal detectable change (MDC) need to be determined to allow comparability for routine measurements in clinical settings of symptomatic hip OA patients. The purpose of this methodological study was to analyze the reliability, agreement, and minimal detectable change of an HHD in individuals with chronic hip pain related to OA. We hypothesized that HHD could be a reliable tool to measure muscle strength for hip muscles even if symptomatic hip OA is present. Our findings could help clinicians and physical therapists to design more rational assessment strategies for individuals with chronic hip pain related to OA, using a tool that requires little training, is low cost, and minimizes the time needed by patients and clinicians.

Study design
A methodological study with repeated measures was conducted to determine the intra-and inter-rater reliability, agreement, and MDC for strength assessment of hip muscles obtained with the HHD, testing subjects who experience chronic hip pain. Participants were assessed in a single-day session. Data collection occurred between August 2021 and March 2022 after approval by the local Ethics Committee (CAAE 40347320.1.1001.0025), following the Helsinki Declaration of 1975. All participants signed an informed consent before data collection. The research was conducted at the Hospital das Forças Armadas (Brasília, Brazil) and Instituto Hospital de Base (Brasília, Brazil), following the guidelines for reporting reliability and agreement studies (GRRAS) [19].

Participants
Twenty participants {40% female, mean age 58.7 (± 15.3) years, age range = 41-79 years, body mass index = 28.8 (± 4.2) kg/m 2 } with symptomatic hip OA were enrolled in the present study from the Orthopedic Department of two tertiary hospitals. Eligibility and demographic data were obtained using an interview questionnaire formulated by the authors. Study procedures were explained to potential participants, and they were assigned to the study protocol if eligible and after giving written informed consent. Participants were included if they presented hip OA radiographically classified as type II (Definite osteophytes, possible joint space narrowing), III (moderate osteophytes, definite joint space narrowing, some sclerosis, possible bone-end deformity), or IV (Large osteophytes, marked joint space narrowing, severe sclerosis and definite bone ends deformity) according to the Kellgren and Lawrence classification [20,21], performed by rater B. All participants had previously been screened with x-ray images as part of their usual care, and no additional image investigation was performed. Other causes of the reported pain, lower limb and back, were also excluded as the primary source of pain, and range of motion was tested to guarantee the hip as the source of the symptoms.

Instruments
Pain intensity was assessed using the Visual Analogue Scale (VAS), with faces ranging from 0 to 10, presented to the participants at the eligibility interview and after each protocol sequence of muscle strength assessment [22,23]. The Western Ontario and McMaster Universities Index (WOMAC) was used as a Health-related Quality of Life (HrQOL) questionnaire developed for patients with hip and knee OA as a self-reported tridimensional scale. The questionnaire evaluates pain, function, and joint stiffness (five questions for the subscale of pain, two questions for the subscale of stiffness, and seventeen questions for the subscale of function). Answer options are presented on a 5-point Likert scale. The total possible score ranges from 0 to 96; the fewer points scored, the better the patient's HrQOL [24,25]. Lastly, to characterize the severity of hip OA, the Harris Hip Score (HHS) was applied by one of the examiners (rater A) to evaluate four domains: Pain (0-44 points), function (0-47 points), absence of deformity (0 or 4 points), and mobility (0-5 points). Scores range from 0 to 100, with higher scores demonstrating less compromised hip joints [26][27][28].
Procedures The HHD Lafayette Manual Muscle Testing System Model-01165 (Lafayette Instrument Company, Lafayette IN, USA) was used to assess hip muscle strength during a three-second maximal effort, following the protocol sequence: hip flexors (seated position), hip abductors, and adductors (supine, long-lever), and hip extensors (prone, long-lever) performed on a regular examination table and collected on the same day by both raters. This time frame was chosen considering the clinical context, that individuals with symptomatic hip OA and older adults have difficulties in moving around, which could affect adherence to a second day of evaluation. We assumed that patients would be more interested in participating in the study protocol if measurements were taken on the same day as their regular medical evaluation. In addition, our protocol aimed to mimic the clinical routine of physicians and physiotherapists when evaluating their patients, reproducing a more realistic scenario to be adopted in practice [19,29].
Participant and rater positions have been described elsewhere [11], with some minor modifications. Hip flexors ( Fig 1A) were evaluated with the participant on an examination table, seated with both legs hanging off the table and arms positioned at the sides of the body, and both knees and hips at 90˚. The assessor was placed right in front of the affected lower limb, holding the HHD with both hands at the anterior aspect of the thigh, 1 to 2 cm above the superior edge of the patella. Participants were instructed to push against the HHD, trying to flex the hip with the maximal force for three seconds. Hip abductors ( Fig 1B) were tested in the supine position, hands crossed in front of the chest, hip and knee at 0˚, with the assessor

PLOS ONE
standing by the side of the examination table and holding the HHD with both hands above the lateral malleolus (long-lever), using their own body to stabilize it. Similarly, the participant tried to abduct the affected hip against the HHD. Hip adductors ( Fig 1C) were evaluated with the participant in the same position, but now, with the HHD held above the medial malleolus (long-lever) and the examiner placing their knee in the middle of both participants' ankles. In this situation, the participant was encouraged to adduct only the affected leg. Finally, the participant was instructed to lie in the prone position to evaluate hip extensors (Fig 1D), arms crossed under the forehead, hip and knee at 0˚. The rater stood immediately in front of the end of the table, holding the device with both hands, elbows extended, at 3-4 cm above the posterior calcaneal tuberosity (long-lever), followed by an attempt to extend the hip while maintaining the knee at full extension. All participants were advised not to flex the knee during hip extension.
Before every protocol sequence of muscle strength assessment, participants were instructed to push against the HHD with their maximum force for three seconds and were reminded that the test starts as they push the HHD and hear a single sound alarm and finishes as they hear a double sound alarm. A submaximal strength test trial was performed in the seated position with the non-affected limb to familiarize the participant with how the device works and the sound alarm. One demonstration was also performed in the supine and prone position to clarify how the test could be performed if required [10,11]. None of the participants had any previous familiarity with this device.
Two independent raters performed data collection, both physicians (V.G.F.; F.F.F) with no experience with the HHD. Raters were allowed to practice the measurements protocol sequence for four months. Data were registered using REDCap (Research Electronic Data Capture) electronic data capture tools hosted at Instituto Hospital de Base [30,31]. Each rater repeated measurements twice on the same day. To minimize any possible effect of cumulative pain resulting from test-retest, the order of data collection was defined using a randomized sequence generated on the website sealedenvelope.com (proportion of 1:1, in blocks of four). Participants were allowed to rest between each protocol sequence until they felt comfortable to start the next round [14]. The VAS for pain was measured after each sequence. Participants were given continuous encouragement to push harder against the HHD to obtain maximal isometric force during the 3 seconds of each test [11,14].

Statistical analysis
Descriptive statistics were used to describe participants' sociodemographic characteristics. The Shapiro-Wilk test was performed to confirm the normal distribution of the data. The Paired ttest was used to compare VAS for pain intensity between intra-and inter-rater measures. Assessment of intra-and inter-rater reliability regarding Pk and Af measures was conducted using the ANOVA 2-way random model, with a Confidence Interval of 95% (95%CI), to compare test-retest measures for intra-rater analysis, and the mean of test-retest for inter-rater analysis. To categorize the reliability between repeated measures, we assessed the intraclass correlation coefficient (ICC 2,1 ), and the correlation between measures was classified as poor (ICC < 0.5), moderate (0.5 � ICC < 0.75), good (0.75 � ICC < 0.90), and excellent (ICC � 0.90) [11,32]. To define the presence of bias in the data and establish the Limit of agreement (LoA) between raters, mean values considering the two measures were plotted with a 95% CI using the Bland-Altman (BA) method [33,34] [35,36]. Statistical significance was assumed when p < 0.05. All statistical analyses were performed using SPSS version 25 (IBM Corp., Chicago, IL, USA), and the BA graphs were plotted by GraphPad Software (San Diego, CA, USA). The sample size was calculated using an acceptable ICC of 0.70, an expected ICC of 0.90, and assuming an α of 5% and power of 80%, with a drop-out rate of 10%, resulting in a minimal sample of 20 participants [37].

Results
The demographic data of the sample are shown in Table 1. Approximately 85% of the participants presented a defined joint space reduction associated with sclerosis and moderate to severe osteophytes (types III/IV), representing the whole spectrum of substantial alterations in the x-ray related to osteoarthritis. Considering all daily activities during the week before inclusion in the study, the pain intensity (VAS; mean ± SD) was 8.05±1.2, 95%CI {7.47-8.62}, and together with an HHS score of 50.2±20.1 and a WOMAC score of 63.5±14.0, the data show considerable pain, dysfunction, and a reduction in quality-of-life related to hip OA.
A statistical difference in VAS (mean±SD, 95%CI) was observed for pain intensity after test and retest for rater A (Test: 6 [38] and the minimal clinically important difference in pain for hip osteoarthritis of 24mm and 30mm regarding a baseline VAS interval of 50 -65mm and >65mm, respectively [39,40], our result did not reach a meaningful change for intra-or interrater VAS between trials. Table 2 shows the mean ± SD values of test-retest Pk and Af, relative reliability expressed as ICC 2,1 , absolute reliability expressed as SEM, and MDC 95 for the four major hip muscle groups, comparing intra-and inter-rater reliability.
The SEM ranged from 0.15 to 0.58Kgf (kilogram-force) for rater A and 0.34 to 1.25kgf for rater B, with rater A being more consistent in the test-retest measurements of Pk and Af for flexor, abductor, and adductor hip muscles. In addition, rater A obtained smaller values of MDC when considering all flexor, abductor, and adductor muscles for Pk and Af measures. This difference between raters was more pronounced in the flexors muscle group, which presented the highest mean values of strength for both raters in the test-retest measurements.
Nevertheless, when we consider the mean of the two measures in the inter-rater analysis of relative reliability, all ICCs for both Pk and Af were classified as excellent (�0.90) with good precision, expressed by the 95% CI; the smallest value was found for Abductor Af (ICC = 0.913; 95%CI{0.774-0.967}) and the highest value for Adductor Af (ICC = 0.983; 95% CI {0.955-0.993}). The absolute reliability found for Pk ranged from 0.14 to 0.28kgf, and for Af, it ranged from 0.09 to 0.42kgf, with better consistency for adductor, followed by extensor, abductor, and flexor muscle groups for both measures. These results of MDC% (95%CI) were smaller than 10% for all Pk measures analyzed, which may reflect a satisfactory parameter when comparing the mean of two measures between different raters.
The Bland-Altman plot (Fig 2) shows the distribution of the differences in mean values between raters (A-B) versus the mean of all measures. The differences were well distributed for

PLOS ONE
abductor, adductor, and extensor muscle groups, demonstrated by the low bias for Pk and Af, with the lowest tendency of disagreement for hip adductors (Pk bias = 0.10 {LoA -2.69 to 2.90}, Fig 2E and Af bias = 0.02 {LoA -1.97 to 2.03}, Fig 2F), followed by hip abductors (Pk bias = -0.2 {LoA -3.05 to 2.63}, Fig 2C and Af bias = -0.28 {LoA -3.99 to 3.41}, Fig 2D), and hip extensors (Pk bias = -0.51 {LoA -5.49 to 4.47}, Fig 2G and Af bias = 0.47 {LoA -2.23 to 3.19}, Fig 2H). The regression line did not show a statistically significant difference in the proportional error for those muscle groups. On the other hand, hip flexor bias demonstrated that differences in measures for rater A for Pk were, on average 0.91 Kgf higher than for rater B (Pk bias = 0.91 {LoA -2.93 to 4.73}, Fig 2A); and the differences in measures for Af were on average 0.95kgf higher than rater B (Af bias = 0.95 {LoA -3.22 to 5.13}, Fig 2B). These higher values seem to be related to a tendency of rater A to measure higher values, with increased mean flexor strength when compared to rater B. The regression analysis showed significant deviations from zero for Pk (p = 0.01) and Af (p = 0.01) in the positive direction, with a higher proportional error for rater B.

Discussion
To our knowledge, this is the first study to assess the use of an HHD in a clinical population with symptomatic hip osteoarthritis, considering the degree of radiographic impairment and pain related to the disease. Our study was designed to reproduce a clinical situation where repeated strength measures could be collected easily in a viable routine rather than a laboratory study design. We demonstrated that the Lafayette HHD is a reliable instrument to evaluate hip muscle strength in this population, with good to excellent intra-and inter-rater reliability, satisfactory consistency, and minimal differences in the intra-rater and inter-rater analyses. Thus, clinicians can use the HHD to evaluate disuse or treatment effects on muscle strength in symptomatic hip OA patients.
Previous studies demonstrated that considering the lower limb musculature, the hip presented the strongest validity and reliability for measures of peak force, comparing the same HHD and a fixed dynamometer. Excellent reliability was also found when comparing the HHD applied by a rater or a belt system. Nevertheless, both these studies evaluated healthy and active subjects, and the authors suggest caution with generalization for the clinical population [11,16]. Just one study assessed the HHD reliability for lower limb strength in older individuals (over 65 years old), including participants with hip and knee OA, demonstrating good intra-and inter-rater reliability for hip and knee muscle strength assessments [10]. However, only~60% of the participants included in that study have hip or knee OA, and the descriptions of the pain and source of symptoms were poorly characterized, which makes comparisons between our results and those of Arnold and colleagues [10] difficult.
Interestingly, the present study demonstrated that the participants present good tolerance for the time taken to perform the measurements (3 seconds), even when pain was also perceived. Collectively, these data also corroborate previous results concerning older adults [10], suggesting that even when the articular disease is present in the lower limb, notably hip OA, the reliability of the HHD is satisfactory to recommend this instrument as a tool for clinical assessment. We also provide adequate information about the characteristics of the participants' hip OA, making it clear how much pain, dysfunction, and reduction in quality of life could be associated with the disease, in order to define more precisely the population of interest in this study. Despite the participants experiencing pain when performing the test protocol, the HHD test demonstrated good to excellent ICC, raising the question of the interference of patient discomfort as a potential limitation to performing tests with enough reliability, as suggested in the literature [12].
Rater A had a better correlation between test-rest measures when compared to rater B for all muscle group measurements for Pk and Af, notably in the flexors group. These results may be explained by the difference in anthropometric measurements of the raters and their presumed strength (1.80m and 85kg versus 1.69m and 68kg), demonstrated previously in the literature as a factor that could influence HHD measurements [1,17,41]. It is possible the use of a stabilization belt system, particularly for hip flexors, could help solve this problem, given that it does not depend on the examiner's strength [17,42]. However, there are conflicting data in the literature regarding the advantage of belt stabilization for HHD, since this device does not provide a stabilization belt [16]. Adaptations to stabilize the device and the lack of a proper method of fixation could interfere with measurements and should be further tested and validated before any recommendations are made.
The most reliable muscle strength measurement was found for hip adductors, followed by extensors and adductors, demonstrated by excellent values of ICC and an adequate 95% CI, ranging from good to excellent reliability values. An exception was observed for intra-rater B reliability, who, despite showing good ICCs for Pk (ICC = 0.851, 95% CI {0.612-0.942}) and Af (ICC = 0.761, 95% CI {0.318-0.908}), presented a wide range of 95% CI, that could be explained by the stronger participants who had larger differences between test-retest for both raters. This result agrees with Kelln and colleagues (2008), who demonstrated that stronger muscles present wider differences in test-retest evaluations. Our data also suggest that the muscle strength assessment would be more feasible in situations with muscle weakness [11,42,43], expressed by the low SEM values in the inter-rater analysis.
The MDC% (95% CI) calculated in the intra-rater analysis was smaller for rater A (ranging from 5.89 to 15.13%) than for rater B, who demonstrated a much wider interval (13.19 to 35.14%). However, when using the mean of two measures for the inter-rater analysis, values of MDC% were considerably reduced, by around 8%, suggesting that at least two measurements should be taken to improve the MDC% and reduce random errors. Values under 10% are considered an adequate parameter to express any real difference instead of a random error of measurement, according to Prentice et al. (2004( , quoted in Chamorro et al., 2017. Our protocol seems to be adequate for clinical purposes, since it can detect small variations that could be attributed to a real difference. Averaging two measures seems to be sufficient to reduce the variability that may result from the measuring instrument, raters, or characteristics of the measure taken, aligned with the theoretical assumption that an average score would better estimate the true value, minimizing the effect of random error [32]. This is consistent with a practical protocol of measurements that could easily be reproduced in a clinical scenario, capable of minimizing time requirements and reducing discomfort/pain from repeated strength tests in a compromised joint, such as a hip with osteoarthritis. Although MDC has been considered worthwhile to screen patient progression with good precision, future studies should consider economic evaluations of screening strategies concerning HHD assessment, with many specific challenges to overcome [44].
The Bland-Altman inter-rater analysis demonstrated small values of bias for abductors, adductors, and extensors when considering the mean of the test and retest. There was a reasonable agreement with a low bias for both variables, Pk and Af, for all muscle groups evaluated, with a tendency to a proportional error only for flexors when comparing raters. However, the LoA demonstrated a large range of fixed error, especially for flexors and extensors. Future studies should evaluate the influence of experience and routine practice on the LoA fixed error range when using this device.
Some limitations should be addressed in our study. We did not perform measures on different days and in different positions, so the conclusions raised here should be restricted to conditions that replicate this protocol and compared with caution when considering studies performed in a different setting. With respect to raters, the experience level of both raters was the same; the inclusion of raters with different levels of expertise and practice with this instrument would reflect a more realistic scenario. The rater's ability to resist hip strength is a very relevant point that could interfere with the reproducibility of measurements [9,12]. Considering that rater B, who weighs 68kg, had some difficulty stabilizing the HHD for hip flexor measurements, we suggest that lighter raters should be intensively trained to achieve better consistency and to rigorously follow the standardized protocol, since it is possible that knowledge of biomechanics and positioning may overcome the influence of his/her body weight and presumed strength [11,45]. Furthermore, the sample size did not allow further analysis of the subgroup related to hip osteoarthritis classification, and the relation between radiographic impairment and HHD reliability may not be inferred from our results. Future studies are needed to evaluate the reliability of the HHD in other clinical situations, such as knee osteoarthritis.

Conclusion
The HHD is a reliable method to evaluate hip muscle strength in individuals with symptomatic hip OA, with good to excellent intra-and inter-rater reliability and low values of SEM, even in the presence of pain related to the disease. The mean of at least two measures provides values with satisfactory agreement and reliability between raters, with adequate precision in an easily applied protocol. This study also provided values for the MDC, which could help to define a threshold to quantify improvements or reductions in hip muscle strength during treatment interventions or evaluation of disease progression with a low-cost, portable, and useful tool that requires little training for routine patient care assessment.